Introduction:

This report leverages the mtcars dataset to address two key questions:

Comparative Analysis of Transmission Types: Determine whether vehicles with automatic transmissions exhibit better fuel efficiency (measured in miles per gallon, MPG) compared to those with manual transmissions. Quantification of MPG Differences: Precisely measure and analyze the MPG difference between automatic and manual transmissions.

Comparison of Manual and Automatic Transmissions Efficiency.

1. Exploratory Data Analysis (EDA)

The mtcars dataset contains 32 observations of 11 variables:

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
## NULL
Data Summary
mpg cyl disp hp drat wt qsec vs am gear carb
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000

Exploratory Plots

Part 2: Exploratory Models

Model mpg~.

Finding More Models

The methodology for this report involves a systemic approach. The code generates a report of a series of potential models, adding one variable at a time and subsequently comparing their statistical outcomes. This preliminary comparative report allows for previewing and identifying the model that best fits the data, ensuring robust and accurate predictions.

Preliminary Report

The table below presents all the models under consideration. Each model in the table was fitted using the R function lm() and evaluated using the corresponding “summary(lm)” output. The models are assessed based on the following criteria:

  • Significant Predictors: The significance of each predictor was determined using the “summary(lm)$coefficients” output, specifically by examining the p-values associated with the t-statistics (found in the Pr(>|t|) column). A predictor is considered significant if its p-value is less than 0.05, indicating a strong association with the response variable. The table shows the number of significant predictors for each model.
  • Adjusted R-Squared: The adjusted R-squared value, obtained from “summary(lm)$adj.r.squared”, is a crucial metric representing the proportion of variance in the dependent variable explained by the independent variables, adjusted for the number of predictors. A higher adjusted R-squared value indicates a better fit, accounting for model complexity.
  • AIC and BIC: The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were calculated using the R functions AIC(lm) and BIC(lm), respectively. Both criteria are used for model selection, balancing the model fit and complexity trade-offs. AIC tends to select slightly more complex models, imposing a lighter penalty for additional parameters, whereas BIC is stricter, favoring simpler models as sample size increases. This report will select the models with the 4 highest adj_R_Squared, lowest AIC and BIC values as the preferred models.
Models Table
Model Adj_R_Squared AIC BIC SignificatPredictors ModelNUmber
mpg~cyl 0.7170527 169.3064 173.7036 1 1
mpg~disp 0.7089548 170.2094 174.6066 1 2
mpg~ cyl+disp 0.7429841 167.1456 173.0086 2 3
mpg~hp 0.5891853 181.2386 185.6358 1 4
mpg~ cyl+disp + hp 0.7430186 168.0184 175.3471 1 5
mpg~drat 0.4461283 190.7999 195.1971 1 6
mpg~ cyl+disp + hp + drat 0.7502914 167.9360 176.7304 1 7
mpg~wt 0.7445939 166.0294 170.4266 1 8
mpg~ cyl+disp + hp + drat + wt 0.8227219 157.7659 168.0260 2 9
mpg~qsec 0.1478062 204.5881 208.9853 1 10
mpg~ cyl+disp + hp + drat + wt + qsec 0.8199798 159.0020 170.7279 1 11
mpg~vs 0.4223126 192.1471 196.5443 1 12
mpg~ cyl+disp + hp + drat + wt + qsec + vs 0.8126278 160.9766 174.1682 1 13
mpg~am 0.3384589 196.4844 200.8816 1 14
mpg~ cyl+disp + hp + drat + wt + qsec + vs + am 0.8218062 160.0075 174.6648 1 15
mpg~gear 0.2050292 202.3638 206.7611 1 16
mpg~ cyl+disp + hp + drat + wt + qsec + vs + am + gear 0.8149224 161.7979 177.9210 1 17

4 & 5. Model Diagnostics and models validation

Selected Models by Highest adj_R_squared
Model Adj_R_Squared AIC BIC SignificatPredictors ModelNUmber
mpg~ cyl+disp + hp + drat + wt 0.8227219 157.7659 168.0260 2 9
mpg~ cyl+disp + hp + drat + wt + qsec 0.8199798 159.0020 170.7279 1 11
mpg~ cyl+disp + hp + drat + wt + qsec + vs + am 0.8218062 160.0075 174.6648 1 15
mpg~ cyl+disp + hp + drat + wt + qsec + vs + am + gear 0.8149224 161.7979 177.9210 1 17

Analysis

  • Anova analysis of variance tables for one or more fitted model objects.the Residual Sum of Squares (RRS), is a metric used in regression analysis to measure the variation of the data that is not explained by the model. It represents the difference between the observed values and the predicted values. A small R-Anova value indicates a good fit between the model and the data, suggesting that most of the variation is explained by the factors included in the model.

  • Variance Indicator Factor (VIF) VIF values help to identify multicollinearity among predictors.Multicollinearity occurs when two or more predictors are highly correlated, leading to unstable estimates of regression coefficients.Values above 10 indicate problematic multicollinearity.

  • Root Mean Squared Error, RMSE is a measure of how well the model’s predictions match the actual values. Lower RMSE indicates better model performance.

  • Residual Plots: Diagnostic plots help to check the assumptions of linear regression, including linearity, homoscedasticity, and normality of residuals.

## [1] "ANOVA"
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
## Model 4: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 167.43                           
## 2     25 163.48  1    3.9493 0.5875 0.4516
## 3     23 148.87  2   14.6040 1.0862 0.3549
## 4     22 147.90  1    0.9717 0.1445 0.7075
## [1] "Variance Indicator Factor VIF"
## [[1]]
##       cyl      disp        hp      drat        wt 
##  7.869010 10.463957  3.990380  2.662298  5.168795 
## 
## [[2]]
##       cyl      disp        hp      drat        wt      qsec 
##  9.958978 10.550573  5.357783  2.966519  7.181690  4.039701 
## 
## [[3]]
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 13.347224 10.646573  5.931238  3.122224  7.599975  6.635692  4.923095  4.162232 
## 
## [[4]]
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 14.573542 11.783934  7.105430  3.230897  7.838669  6.984654  4.923203  4.630597 
##      gear 
##  4.392711
## [1] "RMSE"
## [[1]]
## [1] 2.287371
## 
## [[2]]
## [1] 2.260232
## 
## [[3]]
## [1] 2.156913
## 
## [[4]]
## [1] 2.149863

6 Corrections to Model Diagnostics

In the analysis of variance, at first glance model 1 seems to be significant model. However, cyl and dis show high multicollinearity in all models, that ia also confirmed by the RMS. Additionally, the high RRS values suggests a better fit model exist. To address the issue, cyl and disp are removed from all 4 models and then are compared with first set.

## [1] "ANOVA"
## Analysis of Variance Table
## 
## Model 1: mpg ~ hp + drat + wt
## Model 2: mpg ~ hp + drat + wt + qsec
## Model 3: mpg ~ hp + drat + wt + qsec + vs + am
## Model 4: mpg ~ hp + drat + wt + qsec + vs + am + gear
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 183.68                           
## 2     27 174.10  1    9.5784 1.4499 0.2403
## 3     25 158.56  2   15.5448 1.1765 0.3255
## 4     24 158.56  1    0.0035 0.0005 0.9819
## [1] "Variance Indicator Factor VIF - Corrected Models"
## [[1]]
##       hp     drat       wt 
## 1.769308 2.033837 2.869445 
## 
## [[2]]
##       hp     drat       wt     qsec 
## 4.921958 2.035473 3.582683 2.876115 
## 
## [[3]]
##       hp     drat       wt     qsec       vs       am 
## 5.070665 2.709905 5.105979 5.776361 4.120656 3.272177 
## 
## [[4]]
##       hp     drat       wt     qsec       vs       am     gear 
## 5.364885 3.028679 5.135893 5.794930 4.253778 4.257110 3.452507
## [1] "RMSE-Corrected Models"
## [[1]]
## [1] 2.395842
## 
## [[2]]
## [1] 2.332538
## 
## [[3]]
## [1] 2.225974
## 
## [[4]]
## [1] 2.225949

Despite of the lack of significant improvemnts, the results hits the a rate describing force with respect to time. Which relates to the equation of power.

Proposed Models

Exclusion of the variables “cyl” and “disp” does not result in significant improvements in model performance compared to the initial set of models.

Given that power is defined here as the total horsepower of the car (where 1 horsepower equals 735.5 watts, or kg·m²/s²), which reflects the relationship between mass, distance, and time, two new models have been introduced. These additions aim to address the multicollinearity observed in the previous models.

  1. mpg~(wt+qsec)
  2. mpg~(qsec+hp)

Diagnostic to new models

## [1] "ANOVA"
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt
## Model 2: mpg ~ cyl + disp + hp + drat + wt + qsec
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
## Model 4: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
## Model 5: mpg ~ (wt + qsec)
## Model 6: mpg ~ (qsec + hp)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 167.43                           
## 2     25 163.48  1     3.949 0.5875 0.4516
## 3     23 148.87  2    14.604 1.0862 0.3549
## 4     22 147.90  1     0.972 0.1445 0.7075
## 5     29 195.46 -7   -47.563 1.0107 0.4507
## 6     29 408.89  0  -213.430
## [1] "Variance Indicator Factor VIF - All Models"
## [[1]]
##       cyl      disp        hp      drat        wt 
##  7.869010 10.463957  3.990380  2.662298  5.168795 
## 
## [[2]]
##       cyl      disp        hp      drat        wt      qsec 
##  9.958978 10.550573  5.357783  2.966519  7.181690  4.039701 
## 
## [[3]]
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 13.347224 10.646573  5.931238  3.122224  7.599975  6.635692  4.923095  4.162232 
## 
## [[4]]
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 14.573542 11.783934  7.105430  3.230897  7.838669  6.984654  4.923203  4.630597 
##      gear 
##  4.392711 
## 
## [[5]]
##       hp     drat       wt 
## 1.769308 2.033837 2.869445 
## 
## [[6]]
##       hp     drat       wt     qsec 
## 4.921958 2.035473 3.582683 2.876115 
## 
## [[7]]
##       hp     drat       wt     qsec       vs       am 
## 5.070665 2.709905 5.105979 5.776361 4.120656 3.272177 
## 
## [[8]]
##       hp     drat       wt     qsec       vs       am     gear 
## 5.364885 3.028679 5.135893 5.794930 4.253778 4.257110 3.452507 
## 
## [[9]]
##       wt     qsec 
## 1.031487 1.031487 
## 
## [[10]]
##     qsec       hp 
## 2.006342 2.006342
## [1] "RMSE- All Models"
## [[1]]
## [1] 2.287371
## 
## [[2]]
## [1] 2.260232
## 
## [[3]]
## [1] 2.156913
## 
## [[4]]
## [1] 2.149863
## 
## [[5]]
## [1] 2.471485
## 
## [[6]]
## [1] 3.574623

Final Model

It’s important to note that the ANOVA test results align with the Q-Q Residual Plots for each model, showing that there is little statistically significant improvement, particularly in the tails. In most plots, the tails deviate noticeably from the linear regression line. However, Model 5 demonstrates a distinct pattern, where the residuals eventually return to the line. In terms of variance, Model 5 also shows the least multicollinearity according to the VIF results, despite having a significantly lower RMSE compared to the other models. Therefore, model 5 is used to predict and answers the questions proposed in this report.

Response to project question 1

Now that the linear model is established, it’s beneficial to visualize the data before making predictions.

Based on the graph, there’s minimal difference in MPG between the two transmission types; however, manual transmissions tend to have slightly higher MPG. While it might be tempting to simply observe where the two datasets intersect to determine the difference, this report aims to analyze the overall trend. To achieve this, a simulation using R’s prediction function is employed. The difference in the mean predicted MPG between the two subsets quantifies the fuel efficiency gap between manual and automatic cars.

Prediction

## [1] "Automatic Transmission"
##      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
##           21.580569           18.196114           21.068588           16.443423 
##           Merc 240D            Merc 230            Merc 280           Merc 280C 
##           22.227120           25.123713           19.385488           19.943006 
##          Merc 450SE          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##           15.368981           17.271134           17.390414            9.951297 
## Lincoln Continental   Chrysler Imperial       Toyota Corona    Dodge Challenger 
##            8.924276            8.951388           25.896199           17.652896 
##         AMC Javelin          Camaro Z28    Pontiac Firebird 
##           18.481530           14.680913           16.179557
## [1] "Manual Transmission"
##      Mazda RX4  Mazda RX4 Wag     Datsun 710       Fiat 128    Honda Civic 
##       21.81511       21.04822       25.32728       26.73215       28.80248 
## Toyota Corolla      Fiat X1-9  Porsche 914-2   Lotus Europa Ford Pantera L 
##       28.97422       27.54022       24.46115       27.81207       17.21749 
##   Ferrari Dino  Maserati Bora     Volvo 142E 
##       20.16588       15.29122       22.99592
## [1] "Mean Difference"
## [1] 6.089752

Results:

Automatic transmissions are more mpg efficient than manual transmissions. by a factor of 6.090 mpg. Which makes sense because automatic transmissions selects the right gear without driver input.